In this world, crimes are an indivisible piece of our lives. Consistently we catch wind of them and a few of us are even associated with crimes or encourage them by tolerating crimes during our life. Increased population, technological advancements and heightened competition for economic resources have given rise to a range of new social problems that need a resolution. The protection of life and property has been the primary goal for law enforcement officials.
We have to utilize present day innovation and information science procedures to astutely act against this issue. Crime data analysis makes it possible for law enforcement officials to objectively determine the nature of criminal activity and develop directed patrolling and tactical action plans to effectively combat it. At the same time, this analysis is also used to make sure officials are using their limited resources to their best advantage.
There are such a large number of records and documentation in the police office that have been assembled during the years, which can be utilized as an important wellspring of information for the data analysis assignments. This analysis will not just benefit law enforcement agencies but also other fields like real estate to determine the crime rate of a particular area and how it can affect the prices of real estate.
In today’s world, all of us give security a higher priority so, with this analysis we hope to lend a helping hand in making everyone feel safe.
GOAL: Help law enforcement officials use their limited resources in an efficient manner
QUESTIONS
Lets start by loading the packages we require for the project
import numpy as np
import pandas as pd
import pandas_profiling
from pandas import *
import os
import csv
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from scipy import stats
sns.set_style("darkgrid")
import matplotlib.image as mpimg
from IPython.display import IFrame
import warnings
warnings.filterwarnings("ignore")
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin
from sklearn.model_selection import cross_validate as cv
from sklearn.model_selection import train_test_split
import folium
from folium import plugins
import folium.plugins
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
import chart_studio.plotly as py
import pylab as pl
from mpl_toolkits.mplot3d import Axes3D
from bubbly.bubbly import bubbleplot
from plotly.graph_objs import Scatter, Figure, Layout
init_notebook_mode(connected=True)
cdata=pd.read_csv("/Users/aishwaryanambiar/Documents/IDS Project/Crimes_-_2001_to_present.csv", iterator=True, chunksize=100000)
crime_data = concat(cdata, ignore_index=True)
Printing the shape of the data
crime_data.shape
Checking the dataset for its different columns
for col in crime_data.columns:
print(col)
Showing the first five rows of the dataframe crime_data
crime_data.head()
Checking the different crimes in the column Primary Type
crimes = crime_data['Primary Type'].sort_values().unique()
crimes, len(crimes)
The different Ward codes can be found here: https://www.chicago.gov/city/en/about/wards.html. From our reference we know that there are 50 wards
crimes = crime_data['Ward'].sort_values().unique()
crimes, len(crimes)
IUCR stands for Illinois Uniform Crime Reporting, it encodes different nature of crime using a specific code table. the list of IUCR codes for different crimes can be found https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e/data.
crimes = crime_data['IUCR'].sort_values().unique()
crimes, len(crimes)
The different District codes can be found here: https://home.chicagopolice.org/community/districts/.
crimes = crime_data['District'].sort_values().unique()
crimes, len(crimes)
From our reference we know that there are 25 districts. Hence the value 31 is an error. District 13 and 23 are not listed.
crimes = crime_data['Year'].sort_values().unique()
crimes, len(crimes)
Showing the Map of Chicago divided by District
plt.figure(figsize=(10,18))
img = mpimg.imread('/Users/aishwaryanambiar/Documents/ABI Project/chicago_map1.png')
plt.imshow(img)
As we can see from the above figure there are 25 districts in Chicago but the data set seems to show a district 31 which clearly does not exist. Hence, we can conclude that it is an error and so we will now drop that value from the dataset so as to not have any discrepancies in our analysis due to incorrect data.
crime_data = crime_data[crime_data['District'] != 31]
crimes = crime_data['District'].sort_values().unique()
crimes, len(crimes)
Dropping Years 2001 and 2019 because when we moved further into our analysis we found that most of the NULL values were in the year 2001 and using 2019 would obscure our results because the year is not over. Hence the analysis will now be of the data from 2002-2018
crime_data = crime_data[crime_data['Year'] != 2001]
crime_data = crime_data[crime_data['Year'] != 2019]
crimes = crime_data['Year'].sort_values().unique()
crimes, len(crimes)
For the purposes of analysis and cleaning, the crimes need to be classified into different categories. The categories i will use are as follows:
Violent Crimes
The following are the different crimes in this category
Selecting the different columns we need for the analysis and classifying the different crimes.
col2 = ['ID','Year','Date','Primary Type','Arrest','Domestic','District','Ward','IUCR','X Coordinate','Y Coordinate','Latitude','Longitude','Location','Location Description']
violent_crimes = crime_data[col2]
violent_crimes = violent_crimes[violent_crimes['Primary Type']\
.isin(['HOMICIDE','ASSAULT','BATTERY','CRIM SEXUAL ASSAULT', 'DOMESTIC VIOLENCE', 'HUMAN TRAFFICKING', 'KIDNAPPING', 'ROBBERY'])]
# clean some rouge (0,0) coordinates
violent_crimes = violent_crimes[violent_crimes['X Coordinate']!=0]
violent_crimes.head()
Using some visualizations to see how the different crimes as distributed throughout the city
g = sns.lmplot(x="X Coordinate",
y="Y Coordinate",
col="Primary Type",
data=violent_crimes.dropna(),
col_wrap=2, size=6, fit_reg=False,
sharey=False,
scatter_kws={"marker": "D",
"s": 10})
Exploring the data for cleaning
Below we are checking the attributes of our dataframe
violent_crimes.info()
Checking for Unique values
print(violent_crimes.apply(lambda x: len(x.unique())))
Now we will check the data for null values
violent_crimes.isnull().sum()
In the above data there are many null values in the attributes related to Location which makes the analysis using location difficult. But the null values in the District attribute are comparitively low so the analysis can be done on the basis of the different districts of the city of Chicago. Hence we will drop all the null values.
Dropping the null values
violent_crimes_clean = violent_crimes.dropna()
violent_crimes_clean.isnull().sum().sum()
Analysing all the violent crimes per district
violent_crimes_clean = violent_crimes_clean.loc[(violent_crimes_clean['X Coordinate']!=0)]
sns.lmplot('X Coordinate',
'Y Coordinate',
data=violent_crimes_clean[:],
fit_reg=False,
hue="District",
palette='Dark2',
size=12,
ci=2,
scatter_kws={"marker": "D",
"s": 10})
ax = plt.gca()
ax.set_title("All Violent Crimes (2002-2018) per District")
Doing some date time processing
violent_crimes_clean['Date'] = pd.to_datetime(violent_crimes_clean.Date)
violent_crimes_clean['date'] = [d.date() for d in violent_crimes_clean['Date']]
violent_crimes_clean['time'] = [d.time() for d in violent_crimes_clean['Date']]
violent_crimes_clean['time'] = violent_crimes_clean['time'].astype(str)
empty_list = []
for timestr in violent_crimes_clean['time'].tolist():
ftr = [3600,60,1]
var = sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
empty_list.append(var)
violent_crimes_clean['seconds'] = empty_list
Analysing all the Violent Crimes Yearly
violent_crimes_clean['Year'].value_counts().plot(kind='bar')
plt.title('Analysis of crime yearly')
plt.xlabel('Year')
plt.ylabel('Violent Crimes')
plt.show()
From the figure above we can conclude that the violent crimes were highest in the year 2003 and lowest in the year 2015.
Analysing the Arrests made in relation to Violent Crimes
violent_crimes_clean['Arrest'].value_counts().plot(kind='bar')
plt.title('Arrests')
plt.xlabel('Arrests')
plt.ylabel('Violent Crimes')
plt.show()
The figure above shows that there were very few arrests made in relation to violent crimes.
Creating subset for clustering
Here we are creating a subset of the violent crimes data set that includes the attributes 'Ward', 'IUCR' and 'District'.
sub_data = violent_crimes_clean[['Ward', 'IUCR', 'District']]
sub_data = sub_data.apply(lambda x:x.fillna(x.value_counts().index[0]))
sub_data['IUCR'] = sub_data.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data.head()
Elbow curve
In order to decide how many clusters we must use, we will use the elbow curve to determine the optimal number of clusters to use.
Here the data is not normalized.
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data).score(sub_data) for i in range(len(kmeans))]
score
plt.plot(N,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
Here the the optimal number of clusters is 5 when the data is not normalized. But for the purposes of KMeans clustering we need to normalize data because otherwise the clustering will be done solely on the basis of euclidean distances.
But we will still apply K-means to see the difference between clustering with normalized data and data that is not.
km = KMeans(n_clusters=5)
km.fit(sub_data)
y = km.predict(sub_data)
labels = km.labels_
sub_data['Cluster'] = y
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data['Ward'])
y = np.array(sub_data['IUCR'])
z = np.array(sub_data['District'])
ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data["Cluster"], s=60, cmap="jet")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
Now, we need to normalize the data in order avoid clustering solely on the basis of Euclidean distance.
sub_data['IUCR'] = (sub_data['IUCR'] - sub_data['IUCR'].min())/(sub_data['IUCR'].max()-sub_data['IUCR'].min())
sub_data['Ward'] = (sub_data['Ward'] - sub_data['Ward'].min())/(sub_data['Ward'].max()-sub_data['Ward'].min())
sub_data['District'] = (sub_data['District'] - sub_data['District'].min())/(sub_data['District'].max()-sub_data['District'].min())
Now we will again use the Elbow method to the optimal number of clusters to be used for the normalized data.
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data).score(sub_data) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
del sub_data['Cluster']
From the above elbow curve, we can see the optimal number of clusters is 4.
Now we will apply KMeans.
km = KMeans(n_clusters=4)
km.fit(sub_data)
y = km.predict(sub_data)
labels = km.labels_
sub_data['Clusters'] = y
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data['Ward'])
y = np.array(sub_data['IUCR'])
z = np.array(sub_data['District'])
ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data["Clusters"], s=60, cmap="winter")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
Normalizing the time to be between 0 and 1, this way lower values would indicate midnight to early morning, medium values would indicate the afternoon sessions, and high values would indicate evening and night time also kmeans then won't cluster just based on the time as the range of euclidean distances in time column will be very high without scaling.
violent_crimes_clean['Normalized_time'] = (violent_crimes_clean['seconds'] - violent_crimes_clean['seconds'].min())/(violent_crimes_clean['seconds'].max()-violent_crimes_clean['seconds'].min())
Now we will perform clustering using 'IUCR', 'Normalized_time' and 'District'
sub_data1 = violent_crimes_clean[['IUCR', 'Normalized_time', 'District']]
sub_data1['IUCR'] = sub_data1.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data1['IUCR'] = (sub_data1['IUCR'] - sub_data1['IUCR'].min())/(sub_data1['IUCR'].max()-sub_data1['IUCR'].min())
sub_data1['District'] = (sub_data1['District'] - sub_data1['District'].min())/(sub_data1['District'].max()-sub_data1['District'].min())
sub_data1.head()
Using Elbow method to determine optimal number of clusters
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data1).score(sub_data1) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
From the Elbow curve we can see the optimal number of clusters is 4. Hence we will now apply KMeans using 4 clusters
km = KMeans(n_clusters=4)
km.fit(sub_data1)
y = km.predict(sub_data1)
labels = km.labels_
sub_data1['Clusters'] = y
sub_data1.head()
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data1['Normalized_time'])
y = np.array(sub_data1['IUCR'])
z = np.array(sub_data1['District'])
ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()
All these clusters look colorful but what do they tell us? This is one of the biggest drawbacks of the K-means clustering technique that unless you know the data really well the clusters don’t make sense. Hence we will apply Agglomerative Clustering to overcome these drawbacks and visualize the clusters in the form of a heatmap so we can analyze the data better.
Standardizing the datetime for Agglomerative Clustering
from datetime import datetime
violent_crimes_clean['Date'] = pd.to_datetime(violent_crimes_clean.Date,format='%m/%d/%Y %I:%M:%S %p')
crime_data['Date']= pd.to_datetime(crime_data.Date,format='%m/%d/%Y %I:%M:%S %p')
for i in (violent_crimes_clean,crime_data):
i['year']=i.Date.dt.year
i['month']=i.Date.dt.month
i['day']=i.Date.dt.day
i['Hour']=i.Date.dt.hour
hour_by_type = violent_crimes_clean.pivot_table(values='ID', index='Primary Type', columns=i.Date.dt.hour, aggfunc=np.size).fillna(0)
hour_by_district = violent_crimes_clean.pivot_table(values='ID', index='Primary Type', columns='District', aggfunc=np.size).fillna(0)
Once we have created the parameters for our cluster analysis we will now apply Agglomerative Clustering.
from sklearn.cluster import AgglomerativeClustering as AC
def scale_df(df,axis=0):
return (df - df.mean(axis=axis)) / df.std(axis=axis)
def plot_hmap(df, ix=None, cmap='PuRd'):
if ix is None:
ix = np.arange(df.shape[0])
plt.imshow(df.iloc[ix,:], cmap=cmap)
plt.colorbar(fraction=0.03)
plt.yticks(np.arange(df.shape[0]), df.index[ix])
plt.xticks(np.arange(df.shape[1]))
plt.grid(False)
plt.show()
def scale_and_plot(df, ix = None):
df_marginal_scaled = scale_df(df.T).T
if ix is None:
ix = AC(4).fit(df_marginal_scaled).labels_.argsort()
cap = np.min([np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
plot_hmap(df_marginal_scaled, ix=ix)
In order to understand the cluster analysis better we will visulize the clusters using a heatmap.
Hence we first Cluster with Time (Hours of the day) and Primary Type to determine at which hour the crimes are more likely to occur.
CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_type)
From the following figure we can answer the first question (What time of the day is the most patrolling required?) in regards to violent crimes:
The time of the day when violent crimes are most likely to occur is from the late hours of the day to the early hours of the morning.
Next we cluster with Primary Type and District to determine which crime is more likely to occur in which district.
CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_district)
From the following figure we can answer the second question (Which area in the city is the most dangerous?) in regards to violent crimes:
From the above figure we can see that the problem areas in the city are Districts 1 to 12 and 25 and the relatively safe areas are districts 13 to 24.
Property Related Crimes
property_crimes = crime_data[col2]
property_crimes = property_crimes[property_crimes['Primary Type']\
.isin(['ARSON','BURGLARY','CRIMINAL DAMAGE','CRIMINAL TRESPASS', 'MOTOR VEHICLE THEFT', 'THEFT'])]
# clean some rouge (0,0) coordinates
property_crimes = property_crimes[property_crimes['X Coordinate']!=0]
property_crimes.head()
p = sns.lmplot(x="X Coordinate",
y="Y Coordinate",
col="Primary Type",
data=property_crimes.dropna(),
col_wrap=2, size=6, fit_reg=False,
sharey=False,
scatter_kws={"marker": "D",
"s": 10})
property_crimes.info()
property_crimes.isnull().sum()
property_crimes_clean = property_crimes.dropna()
property_crimes_clean.isnull().sum().sum()
Analysing all property related crime per district
property_crimes_clean = property_crimes_clean.loc[(property_crimes_clean['X Coordinate']!=0)]
sns.lmplot('X Coordinate',
'Y Coordinate',
data=property_crimes_clean[:],
fit_reg=False,
hue="District",
palette='Dark2',
size=12,
ci=2,
scatter_kws={"marker": "D",
"s": 10})
ax = plt.gca()
ax.set_title("All Property Related Crimes (2001-present) per District")
Doing some date time processing
property_crimes_clean['Date'] = pd.to_datetime(property_crimes_clean.Date)
property_crimes_clean['date'] = [d.date() for d in property_crimes_clean['Date']]
property_crimes_clean['time'] = [d.time() for d in property_crimes_clean['Date']]
property_crimes_clean['time'] = property_crimes_clean['time'].astype(str)
empty_list = []
for timestr in property_crimes_clean['time'].tolist():
ftr = [3600,60,1]
var = sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
empty_list.append(var)
property_crimes_clean['seconds'] = empty_list
Analysing the Property Related crime yearly
property_crimes_clean['Year'].value_counts().plot(kind='bar')
plt.title('Analysis of crime yearly')
plt.xlabel('Year')
plt.ylabel('Property Crimes')
plt.show()
From the figure above we can see that property related crimes were the highest in the year 2003 and lowest in the year 2015.
Analysing the Arrests made in relation to Property Related Crimes
property_crimes_clean['Arrest'].value_counts().plot(kind='bar')
plt.title('Arrests')
plt.xlabel('Arrests')
plt.ylabel('Property Crimes')
plt.show()
From the figure above we can see that there were very less arrests in relation to property related crimes.
Creating a subset for clustering
sub_data_prop = property_crimes_clean[['Ward', 'IUCR', 'District']]
sub_data_prop = sub_data_prop.apply(lambda x:x.fillna(x.value_counts().index[0]))
sub_data_prop['IUCR'] = sub_data_prop.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_prop.head()
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_prop).score(sub_data_prop) for i in range(len(kmeans))]
score
plt.plot(N,score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
km = KMeans(n_clusters=3)
km.fit(sub_data_prop)
y = km.predict(sub_data_prop)
labels = km.labels_
sub_data_prop['Cluster'] = y
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_prop['Ward'])
y = np.array(sub_data_prop['IUCR'])
z = np.array(sub_data_prop['District'])
ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data_prop["Cluster"], s=60, cmap="jet")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
sub_data_prop['IUCR'] = (sub_data_prop['IUCR'] - sub_data_prop['IUCR'].min())/(sub_data_prop['IUCR'].max()-sub_data_prop['IUCR'].min())
sub_data_prop['Ward'] = (sub_data_prop['Ward'] - sub_data_prop['Ward'].min())/(sub_data_prop['Ward'].max()-sub_data_prop['Ward'].min())
sub_data_prop['District'] = (sub_data_prop['District'] - sub_data_prop['District'].min())/(sub_data_prop['District'].max()-sub_data_prop['District'].min())
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_prop).score(sub_data_prop) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
del sub_data_prop['Cluster']
km = KMeans(n_clusters=4)
km.fit(sub_data_prop)
y = km.predict(sub_data_prop)
labels = km.labels_
sub_data_prop['Clusters'] = y
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_prop['Ward'])
y = np.array(sub_data_prop['IUCR'])
z = np.array(sub_data_prop['District'])
ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data_prop["Clusters"], s=60, cmap="winter")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
property_crimes_clean['Normalized_time'] = (property_crimes_clean['seconds'] - property_crimes_clean['seconds'].min())/(property_crimes_clean['seconds'].max()-property_crimes_clean['seconds'].min())
sub_data_prop1 = property_crimes_clean[['IUCR', 'Normalized_time', 'District']]
sub_data_prop1['IUCR'] = sub_data_prop1.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_prop1['IUCR'] = (sub_data_prop1['IUCR'] - sub_data_prop1['IUCR'].min())/(sub_data_prop1['IUCR'].max()-sub_data_prop1['IUCR'].min())
sub_data_prop1['District'] = (sub_data_prop1['District'] - sub_data_prop1['District'].min())/(sub_data_prop1['District'].max()-sub_data_prop1['District'].min())
sub_data_prop1.head()
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_prop1).score(sub_data_prop1) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
km = KMeans(n_clusters=5)
km.fit(sub_data_prop1)
y = km.predict(sub_data_prop1)
labels = km.labels_
sub_data_prop1['Clusters'] = y
sub_data_prop1.head()
#Plotting the results of 5 clusters
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_prop1['Normalized_time'])
y = np.array(sub_data_prop1['IUCR'])
z = np.array(sub_data_prop1['District'])
ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data_prop1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()
Standardizing the datetime for Agglomerative Clustering
from datetime import datetime
property_crimes_clean['Date'] = pd.to_datetime(property_crimes_clean.Date,format='%m/%d/%Y %I:%M:%S %p')
crime_data['Date']= pd.to_datetime(crime_data.Date,format='%m/%d/%Y %I:%M:%S %p')
for i in (property_crimes_clean,crime_data):
i['year']=i.Date.dt.year
i['month']=i.Date.dt.month
i['day']=i.Date.dt.day
i['Hour']=i.Date.dt.hour
hour_by_type = property_crimes_clean.pivot_table(values='ID', index='Primary Type', columns=i.Date.dt.hour, aggfunc=np.size).fillna(0)
hour_by_district = property_crimes_clean.pivot_table(values='ID', index='Primary Type', columns='District', aggfunc=np.size).fillna(0)
Implementing Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering as AC
def scale_df(df,axis=0):
return (df - df.mean(axis=axis)) / df.std(axis=axis)
def plot_hmap(df, ix=None, cmap='PuRd'):
if ix is None:
ix = np.arange(df.shape[0])
plt.imshow(df.iloc[ix,:], cmap=cmap)
plt.colorbar(fraction=0.03)
plt.yticks(np.arange(df.shape[0]), df.index[ix])
plt.xticks(np.arange(df.shape[1]))
plt.grid(False)
plt.show()
def scale_and_plot(df, ix = None):
df_marginal_scaled = scale_df(df.T).T
if ix is None:
ix = AC(4).fit(df_marginal_scaled).labels_.argsort()
cap = np.min([np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
plot_hmap(df_marginal_scaled, ix=ix)
Visualizing the clusters using heatmaps
Hence we first Cluster with Time (Hours of the day) and Primary Type to determine at which hour the crimes are more likely to occur.
CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_type)
From the following figure we can answer the first question (What time of the day is the most patrolling required?) in regards to property related crimes:
The time of the day when property related crimes are most likely to occur is from the midday to the early hours of the morning like 1am.
Next we cluster with Primary Type and District to determine which crime is more likely to occur in which district.
CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_district)
From the following figure we can answer the second question (Which area in the city is the most dangerous?) in regards to property related crimes:
From the above figure we can see that the problem areas in the city are Districts 1 to 12 and 25 and the relatively safe areas are districts 13 to 24.
Gang Related Crimes
gang_crimes = crime_data[col2]
gang_crimes = gang_crimes[gang_crimes['Primary Type']\
.isin(['HOMICIDE','CONCEALED CARRY LICENSE VIOLATION','NARCOTICS','WEAPONS VIOLATION'])]
# clean some rouge (0,0) coordinates
gang_crimes = gang_crimes[gang_crimes['X Coordinate']!=0]
gang_crimes.head()
h = sns.lmplot(x="X Coordinate",
y="Y Coordinate",
col="Primary Type",
data=gang_crimes.dropna(),
col_wrap=2, size=6, fit_reg=False,
sharey=False,
scatter_kws={"marker": "D",
"s": 10})
gang_crimes.info()
gang_crimes.isnull().sum()
gang_crimes_clean = gang_crimes.dropna()
gang_crimes_clean.isnull().sum().sum()
Analysing all Gang- Related crimes per district
gang_crimes_clean = gang_crimes_clean.loc[(gang_crimes_clean['X Coordinate']!=0)]
sns.lmplot('X Coordinate',
'Y Coordinate',
data=gang_crimes_clean[:],
fit_reg=False,
hue="District",
palette='Dark2',
size=12,
ci=2,
scatter_kws={"marker": "D",
"s": 10})
ax = plt.gca()
ax.set_title("All Gang Related Crimes (2001-present) per District")
Doing same date time processing
gang_crimes_clean['Date'] = pd.to_datetime(gang_crimes_clean.Date)
gang_crimes_clean['date'] = [d.date() for d in gang_crimes_clean['Date']]
gang_crimes_clean['time'] = [d.time() for d in gang_crimes_clean['Date']]
gang_crimes_clean['time'] = gang_crimes_clean['time'].astype(str)
empty_list = []
for timestr in gang_crimes_clean['time'].tolist():
ftr = [3600,60,1]
var = sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
empty_list.append(var)
gang_crimes_clean['seconds'] = empty_list
Analysis of Gang Related Crimes yearly
gang_crimes_clean['Year'].value_counts().plot(kind='bar')
plt.title('Analysis of crime yearly')
plt.xlabel('Year')
plt.ylabel('Gang Crimes')
plt.show()
From the figure below we can conclude that gang related crimes were the highest in the year 2004 and lowest in the year 2017
Analysis of Arrests in relation to Gang Related Crimes
gang_crimes_clean['Arrest'].value_counts().plot(kind='bar')
plt.title('Arrests')
plt.xlabel('Arrests')
plt.ylabel('Gang Crimes')
plt.show()
The figure above shows that there were more arrests in gang related crimes.
Implementing K-means clustering
sub_data_gang = gang_crimes_clean[['Ward', 'IUCR', 'District']]
sub_data_gang = sub_data_gang.apply(lambda x:x.fillna(x.value_counts().index[0]))
sub_data_gang['IUCR'] = sub_data_gang.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_gang.head()
sub_data_gang['IUCR'] = (sub_data_gang['IUCR'] - sub_data_gang['IUCR'].min())/(sub_data_gang['IUCR'].max()-sub_data_gang['IUCR'].min())
sub_data_gang['Ward'] = (sub_data_gang['Ward'] - sub_data_gang['Ward'].min())/(sub_data_gang['Ward'].max()-sub_data_gang['Ward'].min())
sub_data_gang['District'] = (sub_data_gang['District'] - sub_data_gang['District'].min())/(sub_data_gang['District'].max()-sub_data_gang['District'].min())
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_gang).score(sub_data_gang) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
km = KMeans(n_clusters=4)
km.fit(sub_data_gang)
y = km.predict(sub_data_gang)
labels = km.labels_
sub_data_gang['Clusters'] = y
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_gang['Ward'])
y = np.array(sub_data_gang['IUCR'])
z = np.array(sub_data_gang['District'])
ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data_gang["Clusters"], s=60, cmap="winter")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
gang_crimes_clean['Normalized_time'] = (gang_crimes_clean['seconds'] - gang_crimes_clean['seconds'].min())/(gang_crimes_clean['seconds'].max()-gang_crimes_clean['seconds'].min())
sub_data_gang1 = gang_crimes_clean[['IUCR', 'Normalized_time', 'District']]
sub_data_gang1['IUCR'] = sub_data_gang1.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_gang1['IUCR'] = (sub_data_gang1['IUCR'] - sub_data_gang1['IUCR'].min())/(sub_data_gang1['IUCR'].max()-sub_data_gang1['IUCR'].min())
sub_data_gang1['District'] = (sub_data_gang1['District'] - sub_data_gang1['District'].min())/(sub_data_gang1['District'].max()-sub_data_gang1['District'].min())
sub_data_gang1.head()
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_gang1).score(sub_data_gang1) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
km = KMeans(n_clusters=5)
km.fit(sub_data_gang1)
y = km.predict(sub_data_gang1)
labels = km.labels_
sub_data_gang1['Clusters'] = y
sub_data_gang1.head()
#Plotting the results of 5 clusters
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_gang1['Normalized_time'])
y = np.array(sub_data_gang1['IUCR'])
z = np.array(sub_data_gang1['District'])
ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data_gang1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()
Standardizing datetime for Agglomerative Clustering
from datetime import datetime
gang_crimes_clean['Date'] = pd.to_datetime(gang_crimes_clean.Date,format='%m/%d/%Y %I:%M:%S %p')
crime_data['Date']= pd.to_datetime(crime_data.Date,format='%m/%d/%Y %I:%M:%S %p')
for i in (gang_crimes_clean,crime_data):
i['year']=i.Date.dt.year
i['month']=i.Date.dt.month
i['day']=i.Date.dt.day
i['Hour']=i.Date.dt.hour
hour_by_type = gang_crimes_clean.pivot_table(values='ID', index='Primary Type', columns=i.Date.dt.hour, aggfunc=np.size).fillna(0)
hour_by_district = gang_crimes_clean.pivot_table(values='ID', index='Primary Type', columns='District', aggfunc=np.size).fillna(0)
Implementing Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering as AC
def scale_df(df,axis=0):
return (df - df.mean(axis=axis)) / df.std(axis=axis)
def plot_hmap(df, ix=None, cmap='PuRd'):
if ix is None:
ix = np.arange(df.shape[0])
plt.imshow(df.iloc[ix,:], cmap=cmap)
plt.colorbar(fraction=0.03)
plt.yticks(np.arange(df.shape[0]), df.index[ix])
plt.xticks(np.arange(df.shape[1]))
plt.grid(False)
plt.show()
def scale_and_plot(df, ix = None):
df_marginal_scaled = scale_df(df.T).T
if ix is None:
ix = AC(4).fit(df_marginal_scaled).labels_.argsort()
cap = np.min([np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
plot_hmap(df_marginal_scaled, ix=ix)
Visualizing the clusters using heatmaps
Hence we first Cluster with Time (Hours of the day) and Primary Type to determine at which hour the crimes are more likely to occur.
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_type)
From the following figure we can answer the first question (What time of the day is the most patrolling required?) in regards to gang related crimes:
The time of the day when gang related crimes are most likely to occur is from the 6 pm to the early hours of the morning till 2 am.
Next we cluster with Primary Type and District to determine which crime is more likely to occur in which district.
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_district)
From the following figure we can answer the second question (Which area in the city is the most dangerous?) in regards to gang related crimes:
From the above figure we can see that the problem areas in the city are Districts 3 to 11, 15 and 25 and the relatively safe areas are districts 1,2,13,14,16 to 24.
Sex Crimes
sex_crimes = crime_data[col2]
sex_crimes = sex_crimes[sex_crimes['Primary Type']\
.isin(['CRIM SEXUAL ASSAULT','OBSCENITY','PROSTITUTION','PUBLIC INDECENCY', 'SEX OFFENSE', 'OFFENSE INVOLVING CHILDREN'])]
# clean some rouge (0,0) coordinates
sex_crimes = sex_crimes[sex_crimes['X Coordinate']!=0]
sex_crimes.head()
s = sns.lmplot(x="X Coordinate",
y="Y Coordinate",
col="Primary Type",
data=sex_crimes.dropna(),
col_wrap=2, size=6, fit_reg=False,
sharey=False,
scatter_kws={"marker": "D",
"s": 10})
sex_crimes.info()
sex_crimes.isnull().sum()
sex_crimes_clean = sex_crimes.dropna()
sex_crimes_clean.isnull().sum().sum()
Analysing the Sex crimes per district
sex_crimes_clean = sex_crimes_clean.loc[(sex_crimes_clean['X Coordinate']!=0)]
sns.lmplot('X Coordinate',
'Y Coordinate',
data=sex_crimes_clean[:],
fit_reg=False,
hue="District",
palette='Dark2',
size=12,
ci=2,
scatter_kws={"marker": "D",
"s": 10})
ax = plt.gca()
ax.set_title("All Sex Crimes (2001-present) per District")
Doing some date time processing
sex_crimes_clean['Date'] = pd.to_datetime(sex_crimes_clean.Date)
sex_crimes_clean['date'] = [d.date() for d in sex_crimes_clean['Date']]
sex_crimes_clean['time'] = [d.time() for d in sex_crimes_clean['Date']]
sex_crimes_clean['time'] = sex_crimes_clean['time'].astype(str)
empty_list = []
for timestr in sex_crimes_clean['time'].tolist():
ftr = [3600,60,1]
var = sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
empty_list.append(var)
sex_crimes_clean['seconds'] = empty_list
Analysing the Sex Crimes yearly
sex_crimes_clean['Year'].value_counts().plot(kind='bar')
plt.title('Analysis of crime yearly')
plt.xlabel('Year')
plt.ylabel('Sex Crimes')
plt.show()
From the above figure we can conclude that sex crimes were the highest in 2004 and lowest in 2017
Analysing the Arrests in relation to Sex Crimes
sex_crimes_clean['Arrest'].value_counts().plot(kind='bar')
plt.title('Arrests')
plt.xlabel('Arrests')
plt.ylabel('Sex Crimes')
plt.show()
The above figure shows that there were more arrests in relation with sex crimes
Implementing k-means
sub_data_sex = sex_crimes_clean[['Ward', 'IUCR', 'District']]
sub_data_sex = sub_data_sex.apply(lambda x:x.fillna(x.value_counts().index[0]))
sub_data_sex['IUCR'] = sub_data_sex.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_sex.head()
sub_data_sex['IUCR'] = (sub_data_sex['IUCR'] - sub_data_sex['IUCR'].min())/(sub_data_sex['IUCR'].max()-sub_data_sex['IUCR'].min())
sub_data_sex['Ward'] = (sub_data_sex['Ward'] - sub_data_sex['Ward'].min())/(sub_data_sex['Ward'].max()-sub_data_sex['Ward'].min())
sub_data_sex['District'] = (sub_data_sex['District'] - sub_data_sex['District'].min())/(sub_data_sex['District'].max()-sub_data_sex['District'].min())
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_sex).score(sub_data_sex) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
km = KMeans(n_clusters=3)
km.fit(sub_data_sex)
y = km.predict(sub_data_sex)
labels = km.labels_
sub_data_sex['Clusters'] = y
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_sex['Ward'])
y = np.array(sub_data_sex['IUCR'])
z = np.array(sub_data_sex['District'])
ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data_sex["Clusters"], s=60, cmap="winter")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
sex_crimes_clean['Normalized_time'] = (sex_crimes_clean['seconds'] - sex_crimes_clean['seconds'].min())/(sex_crimes_clean['seconds'].max()-sex_crimes_clean['seconds'].min())
sub_data_sex1 = sex_crimes_clean[['IUCR', 'Normalized_time', 'District']]
sub_data_sex1['IUCR'] = sub_data_sex1.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_sex1['IUCR'] = (sub_data_sex1['IUCR'] - sub_data_sex1['IUCR'].min())/(sub_data_sex1['IUCR'].max()-sub_data_sex1['IUCR'].min())
sub_data_sex1['District'] = (sub_data_sex1['District'] - sub_data_sex1['District'].min())/(sub_data_sex1['District'].max()-sub_data_sex1['District'].min())
sub_data_sex1.head()
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_sex1).score(sub_data_sex1) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
km = KMeans(n_clusters=4)
km.fit(sub_data_sex1)
y = km.predict(sub_data_sex1)
labels = km.labels_
sub_data_sex1['Clusters'] = y
sub_data_sex1.head()
#Plotting the results of 5 clusters
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_sex1['Normalized_time'])
y = np.array(sub_data_sex1['IUCR'])
z = np.array(sub_data_sex1['District'])
ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data_sex1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()
Standardizing datetime
from datetime import datetime
sex_crimes_clean['Date'] = pd.to_datetime(sex_crimes_clean.Date,format='%m/%d/%Y %I:%M:%S %p')
crime_data['Date']= pd.to_datetime(crime_data.Date,format='%m/%d/%Y %I:%M:%S %p')
for i in (sex_crimes_clean,crime_data):
i['year']=i.Date.dt.year
i['month']=i.Date.dt.month
i['day']=i.Date.dt.day
i['Hour']=i.Date.dt.hour
hour_by_type = sex_crimes_clean.pivot_table(values='ID', index='Primary Type', columns=i.Date.dt.hour, aggfunc=np.size).fillna(0)
hour_by_district = sex_crimes_clean.pivot_table(values='ID', index='Primary Type', columns='District', aggfunc=np.size).fillna(0)
Implementing Agglomerative clustering
from sklearn.cluster import AgglomerativeClustering as AC
def scale_df(df,axis=0):
return (df - df.mean(axis=axis)) / df.std(axis=axis)
def plot_hmap(df, ix=None, cmap='PuRd'):
if ix is None:
ix = np.arange(df.shape[0])
plt.imshow(df.iloc[ix,:], cmap=cmap)
plt.colorbar(fraction=0.03)
plt.yticks(np.arange(df.shape[0]), df.index[ix])
plt.xticks(np.arange(df.shape[1]))
plt.grid(False)
plt.show()
def scale_and_plot(df, ix = None):
df_marginal_scaled = scale_df(df.T).T
if ix is None:
ix = AC(4).fit(df_marginal_scaled).labels_.argsort()
cap = np.min([np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
plot_hmap(df_marginal_scaled, ix=ix)
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_type)
From the figure we can answer the first question (What time of the day is the most patrolling required?) in regards to sex crimes:
The time of the day when sex crimes are most likely to occur is from the 10 pm to the early hours of the morning till 2 am.
Next we cluster with Primary Type and District to determine which crime is more likely to occur in which district.
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_district)
The figure above shows:
The safe districts in relation with sex crimes are 1, 13, 16, 17, 20, 21, 23.
Non-Violent Crimes
nviolent_crimes = crime_data[col2]
nviolent_crimes = nviolent_crimes[nviolent_crimes['Primary Type']\
.isin(['DECEPTIVE PRACTICE','GAMBLING','INTERFERENCE WITH PUBLIC OFFICER','INTIMIDATION', 'LIQUOR LAW VIOLATION', 'OTHER NARCOTIC VIOLATION', 'OTHER OFFENSE', 'PUBLIC PEACE VIOLATION', 'RITUALISM', 'STALKING'])]
# clean some rouge (0,0) coordinates
nviolent_crimes = nviolent_crimes[nviolent_crimes['X Coordinate']!=0]
nviolent_crimes.head()
nv = sns.lmplot(x="X Coordinate",
y="Y Coordinate",
col="Primary Type",
data=nviolent_crimes.dropna(),
col_wrap=2, size=6, fit_reg=False,
sharey=False,
scatter_kws={"marker": "D",
"s": 10})
nviolent_crimes.info()
nviolent_crimes.isnull().sum()
nviolent_crimes_clean = nviolent_crimes.dropna()
nviolent_crimes_clean.isnull().sum().sum()
Analysing all Non-Violent Crimes per district
nviolent_crimes_clean = nviolent_crimes_clean.loc[(nviolent_crimes_clean['X Coordinate']!=0)]
sns.lmplot('X Coordinate',
'Y Coordinate',
data=nviolent_crimes_clean[:],
fit_reg=False,
hue="District",
palette='Dark2',
size=12,
ci=2,
scatter_kws={"marker": "D",
"s": 10})
ax = plt.gca()
ax.set_title("All Sex Crimes (2001-present) per District")
Doing some date time processing
nviolent_crimes_clean['Date'] = pd.to_datetime(nviolent_crimes_clean.Date)
nviolent_crimes_clean['date'] = [d.date() for d in nviolent_crimes_clean['Date']]
nviolent_crimes_clean['time'] = [d.time() for d in nviolent_crimes_clean['Date']]
nviolent_crimes_clean['time'] = nviolent_crimes_clean['time'].astype(str)
empty_list = []
for timestr in nviolent_crimes_clean['time'].tolist():
ftr = [3600,60,1]
var = sum([a*b for a,b in zip(ftr, map(int,timestr.split(':')))])
empty_list.append(var)
nviolent_crimes_clean['seconds'] = empty_list
Analysing the Non-Violent Crimes yearly
nviolent_crimes_clean['Year'].value_counts().plot(kind='bar')
plt.title('Analysis of crime yearly')
plt.xlabel('Year')
plt.ylabel('Non-Violent Crimes')
plt.show()
The figure above shows that the year with the most non-violent crimes was 2003 and the least was the year 2015
Analysis of Arrests in Non-Violent Crimes
nviolent_crimes_clean['Arrest'].value_counts().plot(kind='bar')
plt.title('Arrests')
plt.xlabel('Arrests')
plt.ylabel('Non-Violent Crimes')
plt.show()
The the figure above shows that there weren't many arrests in non-violent crime cases.
Implementing k-means
sub_data_nviolent = nviolent_crimes_clean[['Ward', 'IUCR', 'District']]
sub_data_nviolent = sub_data_nviolent.apply(lambda x:x.fillna(x.value_counts().index[0]))
sub_data_nviolent['IUCR'] = sub_data_nviolent.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_nviolent.head()
sub_data_nviolent['IUCR'] = (sub_data_nviolent['IUCR'] - sub_data_nviolent['IUCR'].min())/(sub_data_nviolent['IUCR'].max()-sub_data_nviolent['IUCR'].min())
sub_data_nviolent['Ward'] = (sub_data_nviolent['Ward'] - sub_data_nviolent['Ward'].min())/(sub_data_nviolent['Ward'].max()-sub_data_nviolent['Ward'].min())
sub_data_nviolent['District'] = (sub_data_nviolent['District'] - sub_data_nviolent['District'].min())/(sub_data_nviolent['District'].max()-sub_data_nviolent['District'].min())
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_nviolent).score(sub_data_nviolent) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
km = KMeans(n_clusters=4)
km.fit(sub_data_nviolent)
y = km.predict(sub_data_nviolent)
labels = km.labels_
sub_data_nviolent['Clusters'] = y
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_nviolent['Ward'])
y = np.array(sub_data_nviolent['IUCR'])
z = np.array(sub_data_nviolent['District'])
ax.set_xlabel('Ward')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data_nviolent["Clusters"], s=60, cmap="winter")
ax.view_init(azim=0)
#print(ax.azim)
plt.show()
nviolent_crimes_clean['Normalized_time'] = (nviolent_crimes_clean['seconds'] - nviolent_crimes_clean['seconds'].min())/(nviolent_crimes_clean['seconds'].max()-nviolent_crimes_clean['seconds'].min())
sub_data_nviolent1 = nviolent_crimes_clean[['IUCR', 'Normalized_time', 'District']]
sub_data_nviolent1['IUCR'] = sub_data_nviolent1.IUCR.str.extract('(\d+)', expand=True).astype(int)
sub_data_nviolent1['IUCR'] = (sub_data_nviolent1['IUCR'] - sub_data_nviolent1['IUCR'].min())/(sub_data_nviolent1['IUCR'].max()-sub_data_nviolent1['IUCR'].min())
sub_data_nviolent1['District'] = (sub_data_nviolent1['District'] - sub_data_nviolent1['District'].min())/(sub_data_nviolent1['District'].max()-sub_data_nviolent1['District'].min())
sub_data_nviolent1.head()
N = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in N]
kmeans
score = [kmeans[i].fit(sub_data_nviolent1).score(sub_data_nviolent1) for i in range(len(kmeans))]
score
pl.plot(N,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()
km = KMeans(n_clusters=4)
km.fit(sub_data_nviolent1)
y = km.predict(sub_data_nviolent1)
labels = km.labels_
sub_data_nviolent1['Clusters'] = y
sub_data_nviolent1.head()
#Plotting the results of 5 clusters
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')
x = np.array(sub_data_nviolent1['Normalized_time'])
y = np.array(sub_data_nviolent1['IUCR'])
z = np.array(sub_data_nviolent1['District'])
ax.set_xlabel('Time')
ax.set_ylabel('IUCR')
ax.set_zlabel('District')
ax.scatter(x,y,z, marker="o", c = sub_data_nviolent1["Clusters"], s=60, cmap="jet")
ax.view_init(azim=-20)
#print(ax.azim)
plt.show()
Standardizing datetime
from datetime import datetime
nviolent_crimes_clean['Date'] = pd.to_datetime(nviolent_crimes_clean.Date,format='%m/%d/%Y %I:%M:%S %p')
crime_data['Date']= pd.to_datetime(crime_data.Date,format='%m/%d/%Y %I:%M:%S %p')
for i in (nviolent_crimes_clean,crime_data):
i['year']=i.Date.dt.year
i['month']=i.Date.dt.month
i['day']=i.Date.dt.day
i['Hour']=i.Date.dt.hour
hour_by_type = nviolent_crimes_clean.pivot_table(values='ID', index='Primary Type', columns=i.Date.dt.hour, aggfunc=np.size).fillna(0)
hour_by_district = nviolent_crimes_clean.pivot_table(values='ID', index='Primary Type', columns='District', aggfunc=np.size).fillna(0)
Implementing Agglomerative clustering
from sklearn.cluster import AgglomerativeClustering as AC
def scale_df(df,axis=0):
return (df - df.mean(axis=axis)) / df.std(axis=axis)
def plot_hmap(df, ix=None, cmap='PuRd'):
if ix is None:
ix = np.arange(df.shape[0])
plt.imshow(df.iloc[ix,:], cmap=cmap)
plt.colorbar(fraction=0.03)
plt.yticks(np.arange(df.shape[0]), df.index[ix])
plt.xticks(np.arange(df.shape[1]))
plt.grid(False)
plt.show()
def scale_and_plot(df, ix = None):
df_marginal_scaled = scale_df(df.T).T
if ix is None:
ix = AC(4).fit(df_marginal_scaled).labels_.argsort()
cap = np.min([np.max(df_marginal_scaled.as_matrix()), np.abs(np.min(df_marginal_scaled.as_matrix()))])
df_marginal_scaled = np.clip(df_marginal_scaled, -1*cap, cap)
plot_hmap(df_marginal_scaled, ix=ix)
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_type)
From the figure we can answer the first question (What time of the day is the most patrolling required?) in regards to non-violent crimes:
The time of the day when non-violent crimes are most likely to occur is from the 10 am to the early hours of the morning till 1 am.
Next we cluster with Primary Type and District to determine which crime is more likely to occur in which district.
#CMAP = 'PuRd'
plt.figure(figsize=(20,10))
scale_and_plot(hour_by_district)
The figure above shows:
The safe districts in relation with non-violent crimes are 1,2,14-22,24.
REFERENCES
https://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf
https://scikit-learn.org/stable/modules/cross_validation.html
https://medium.com/@stafa002/my-notes-on-chicago-crime-data-analysis-ed66915dbb20
https://www.kaggle.com/ol0fmeister/visualizing-crimes-in-chicago-via-clustering/notebook